White Wine Quality Exploration by Jerry Wang

July 26, 2016

Summary of the Data Set

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Our dataset consists of thirteen variables with 4898 observations, the quality of wine has a median of 6 with min of 3 and max of 9. Some wines have no citric acid added, which can add ‘freshness’ and flavor to wines. Quality is the output attribute, 11 input variables (based on physicochemical tests) could be relevent,we will explore it in depth.

Univariate Plots Section

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Wines quality is scored from 0~10, in which 0 is the worest and 10 is the best. Quality histogram appears normal distribution, best quality is 9, most wines quality is scored between 5~6, There are more than 70% of wines in medium quality class.

Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Above three plots for fixed.acidity, volatitle.acidity and citrix.acid all appear normal distribution with some outliers. Especially the maximized fixed.acidity is reached 14.2.

Total Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1527 1527          14.2             0.27        0.49            1.1
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1527     0.037                  33                  156   0.992 3.15
##      sulphates alcohol quality quality.class total.acidity
## 1527      0.54    11.1       6        medium         14.96

I add a new variable called total.acidity, to add up all acid property variables together, the plot appears a normal distribution as well. In the dataset, there is only one wine with total.acidity large than 14, which is quality 6. Becasue of wine brewing features(time, temperture etc.) unkown, I don’t know what caused that.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality quality.class total.acidity
## 2782      0.69    11.7       6        medium         9.365

Distribution of residual.sugar has a long tail on the right side. After tranformed with log10, the distribution appears bimodal with the peaking around 1.5 and 7.5. Residual sugar means the amount of sugar remaining after fermentation stops, normally wine have more than 1 gram/liter sugar and wines with greater than 45 grams/liter are considered sweet. Here, we have minimze sugar is 0.6 and maximize sugar is 65.8. When checking the wine with residual sugar value 65.8, the quality is 6, same as total.acidity high value, I don’t know what caused that either.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chlorides: the amount of salt in the wines, normal distribution, median value is 0.043 and mean is 0.04577, very close to median.

Sulfur Dioxide

## [1] "Summary of total.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Histograms for free SO2, total SO2 and raio of free SO2, all appear normal distribution. Since sulphate can contribute to total sulfur dioxide levels, it has a similar histogram with the total sulfur dioxide.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a very small range from 0.9871 to 1.0390, very close to water’s density, distribution is normal.

PH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH: most wines pH values are between 3.0 - 3.4 on the pH scale(from 0 (very acidic) to 14 (very basic)), distribution is normal.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alochol percentage probably affects the density, pH level and the wine flavors. Just looking at the distributions of different levels quality, seems like the higher level of alcohol, the quality of wines is better.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in the dataset with 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality, and index X).

Quality is the output attribute, scored from 0~10, in which 0 is the worest and 10 is the best, original it’s integer variable(values: 3,4,5,6,7,8,9), 11 input variables(excluded X) are all numerical variables.

Other observations: The best quality of wines is scored 9, which is only 5 quantites, very rare. Most wines quality is in median level 6.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are quality, which may be correlated with some of these physicochemical attributes. I’d like to find out which attributes influence the quality of white wine. I suspect alcohol and some combination of the other attributes can be used to build a predictive model to quality the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Acidity, residual.sugar, total.sulfur.dioxide, pH likely contribute to quality of wines.

Did you create any new variables from existing variables in the dataset?

Yes, I create a new variable quality.class, and will use it to analyse the corelation between variables in the next two sections.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I transformed the positive skewed residual.sugar distributions with log10. The tranformed distribution for residual.sugar appears bimodal with the peaking around 1.5 and 7.5.

I added a new factor, quality.class(low, medium and high), therefore in the Bivariate and Multivariate sections, I can explore those atttributes with different quality groups.


Bivariate Plots Section

Plot Matrix

Looking at the plot matrix, we can find that correlation coefficient between two variables above, the strongest correlations with quality occur with alcohol, density and chlorides(pearson r: 044, -0.31, -0.21). And the strongest correlations with alcohol occur with density, total.sulfur.dioxide, residual.sugar and chlorides (pearson r from -0.78 ~ -0.36).

Quality vs Alcohol

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## [1] 0.4355747
## [1] 0.4675664

In this case, plots show wines with quality.class medium and high tend to have higher alcohol values. The boxplot shows that wines with quality 6~9 have higher alcohol values, correlation pearson r value is 0.436. When quality in the range of 5 ~ 9, r value is 0.468.

Quality vs Density

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970
## [1] -0.3071233

In this case, density vs quality or quality.class plots show wines with quality 5 ~ 9 / medium ~ high tend to have lower density, boxplot also display the same trend as scatterplots, correlation pearson r value is -0.307

Quality vs Chlorides

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## wines$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350
## [1] -0.2099344

In this case, quality vs chlorides scatterplot shows wines with quality 5 ~ 9 tend to have lower chlorides, and boxplot also display the same trend, correlation pearson r value is -0.21

Alcohol vs Density

## [1] -0.7801376

We can see alcohol vs density have negitave linear relationship when we ignore the outliers, correlation pearson r is -0.78

Alcohol vs Total.Sulfur.Dioxide

## [1] -0.4488921

Looking at first scatterplot, total.sulfur.dixoxide vaules distribute on all level of alcohol, there are a few outliners at the lower and higher level cause the trend to be skewed, After zoom in at the second plot, we can see there is negitive trend between those two variables, their correlation pearson r is -0.449

Alcohol vs Residual.Sugar

## [1] -0.4506312

In gereral trend, with residual.sugar values increasing, alcohol values tend to decrease, correlation pearson r is -0.45

Alcohol vs Chlorides

## [1] -0.3601887

With the cholorides increasing in the range of 0-0.1, alcohol values tend to decrease, correlation pearson r is -0.36

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

When Looking at the plot matrix, we can find the strongest correlations with quality occur with alcohol, density and chlorides(pearsion r: 044, -0.31, -0.21).

Wines quality in the range of 6-9 or quality.class in medium and high, with the alcohol values increasing, wines quality tends to increase as well。 On the contrary, wines quality in the rang of 3-5 or quality.class in low level, with the alcohol increasing, wines quality trends to decrease.

Same correaltions happen on quality vs density and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, alcohol has correlations with density, residual.sugar, chlorides. These three variables have negative relationship with alcohol.

What was the strongest relationship you found?

My main purpose is to find which chemical properties influence the quality of wines. After comparing the relationship between quality and relavant variables, I found Alcohol has the strongest positive relationship with quality of wines.

Residual.sugar has the strongest relationship with density in the dataset, whose correlation coefficient is 0.84.


Multivariate Plots Section

Alcohol vs Density with Quality as Color

Here, plots clearly show the negative linear relationship between alcohol and density in all quality levels. Wines with higher quality are in the right side of the plots, which is further shown that higher quality wines tend to have high alcohol and low density.

Alcohol vs Residual.Sugar with Quality as Color

Alcohol vs Chlorides with Quality as Color

Same as Alcohol vs Density, plots show that alcohol has negative linear relationship with residual.sugar and chlorides, and higher quality wines tend to have high alcohol values, low residual.sugar and low chlorides.

Linear Model

## 
## Calls:
## m1: lm(formula = alcohol ~ density, data = wines)
## m2: lm(formula = alcohol ~ density + residual.sugar, data = wines)
## m3: lm(formula = alcohol ~ density + residual.sugar + chlorides, 
##     data = wines)
## 
## =========================================================
##                       m1           m2           m3       
## ---------------------------------------------------------
##   (Intercept)      329.588***   564.755***   544.341***  
##                     (3.657)      (5.365)      (5.626)    
##   density         -320.991***  -558.645***  -537.841***  
##                     (3.679)      (5.414)      (5.684)    
##   residual.sugar                  0.167***     0.159***  
##                                  (0.003)      (0.003)    
##   chlorides                                   -4.614***  
##                                               (0.425)    
## ---------------------------------------------------------
##   R-squared             0.6          0.7          0.8    
##   adj. R-squared        0.6          0.7          0.8    
##   sigma                 0.8          0.6          0.6    
##   F                  7613.4       7302.6       5023.8    
##   p                     0.0          0.0          0.0    
##   Log-likelihood    -5668.6      -4580.9      -4522.6    
##   Deviance           2902.6       1861.6       1817.9    
##   AIC               11343.1       9169.7       9055.2    
##   BIC               11362.6       9195.7       9087.7    
##   N                  4898         4898         4898      
## =========================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Furthermore, according to the multivariate analysis revealed that higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values. Since the plots show there is a linear relationship between alcohol and it’s relavant variables(density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.

Were there any interesting or surprising interactions between features?

In the low quality group of wines, with quality increasing, alcohol value has decreasing trend and chlorides value has increasing trend, which has opposite trend in the medium ~ high quality group of wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a very simple linear model starting from alcohol and density.

The variables in the linear model account for 80% of the variance in the alcohol value of wines. residual.sugar and chlorides variables each imporve the R-squared value by 10%.

Alcohol value is a very important variable in the wines properties, which has the strongest relationship with wines quality. Since I didn’t find the linear relationship between quality and relvant variables, so I choose alcohol as a output to create a linear model. However, wine brewing is a very complated process, there are only fews physicochemical properties in our dataset, it is difficult to make this prodiction more accurated.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Description One

My main purpose is to find which chemical properties influence the quality of wines, therefore I chose the quality distribution histogram as my first final plot. The plots appears normal distribution, Minimum value is 3, maximum value is 9. please see the descriptive statistics above. The quality of wines can be scored from 0~10(worst to best), around 75% of wines are scored in quality 5 and 6. quality 3, 9 of wines are less than 2%. There are no wines with quality less than 3 or greater than 9 in this dataset.

Plot Two

## [1] 0.4355747
## [1] 0.4675664

Description Two

The quality of wines has the strongest relationship with alcohol, so the second plot, I prensented the relationship between quality and alcohol. we can see that alcohol value tend to increase in the range of quality 5 ~ 9. However, in range of quality 3 ~ 5,the means of the alcohol values tend to decrease. Overall there is a positive relationship between alcohol and quality, the correlation coefficient value is 0.436. when set quality range in 5 ~ 9, the correlation coefficient value is 0.468.

Plot Three

## [1] -0.7801376

Description Three

The last plot I chose to visualize the relationship between Density and Alcohol. Density has the strongest linear relationship with alcohol, pearson r value is -0.78. As we know alcohol’s density is less than water’s (the horizontal line on the plot), therefore, while the increase of alcohol percentage, density tends to decrease. The plot also shows that wines with higher quality are in the right side of the plots,which is further illustrative that higher quality wines tend to have high alcohol and low density values.


Reflection

This dataset consists of thirteen variables with 4898 observations. My main purpose is to find which chemical properties influence the quality of white wines, and at same time find the relationships between other features.

Firstly, I started to understand the variables by virsualizing the distribution of individual variables and looked for unusual behaviors in the histograms, and I transformed the residual.sugar variable distributions with log10.

Next, I used plot matrix to calculate and plot the correlations between the variables. None of the correlations with quality are above 0.5, the strongest correlation with quality is alcohol, but correlation coefficient value is only 0.436.

I struggled understanding the relationship between quality and alcohol, I was expecting there was a linear relationship between quality and alcohol, and even tried to build a preliminary linear model, but the accuracy never exceed to 65%. Through bivariate visualization analysis, I finally found out that the quality of wine vs alcohol has two different direction relationships. It has negitive relationship with alcohol in quality 3-5, positive in quality 5-9. Eventually, I explored the quality of wines across with alcohol, density, chlorides. Higher quality wines tend to have high alcohol, low residual.sugar and low chlorides values, so alcohol, density and chlorides infuluence the quality of white wines most.

Since the plots show there is a linear relationship betwen alcohol and it’s relavant variables (density, residual.sugar and chlorides), so that I can build a linear model and use this model to predict the alcohol values.

Other challenges I have experienced mostly focused on the R language itself, like how to use factors, reshaping the dataset and so on, but finanlly I handled them all by checking the online documents and Help files

After I did some rearch, I found that wine brewing is a very complicated process. The quality of wine is affected by many factors, such as grape varieties, geographical location and temperature, fermentation temperature and time, the physicochemical properties in our dataset and more. If we got all those information, I believe we could make a very good model to predict the wines quality, and even use this model to optimize the brewing process.